System and method for improving index performance through prefetching

ABSTRACT

The present invention provides a prefetch system for use with a cache memory associated with a database employing indices. In one embodiment, the prefetch system includes a search subsystem configured to prefetch cache lines containing an index of a node of a tree structure associated with the database. Additionally, the prefetch system also includes a scan subsystem configured to prefetch cache lines based on an index prefetch distance between first and second leaf nodes of the tree structure.

TECHNICAL FIELD OF THE INVENTION

The present invention is directed, in general, to database managementand, more specifically, to a system and method for improving indexperformance through prefetching.

BACKGROUND OF THE INVENTION

As the gap between processor speed and both DRAM and disk speedscontinues to grow exponentially, it is becoming increasingly importantto make effective use of caches to achieve high performance on databasemanagement systems. Caching exists at multiple levels within modernmemory hierarchies. Typically, two or more levels of SRAM serve as cachememories (or “caches,” for short) for the contents of main memory inDRAM, which in turn may serve as a cache memory for the contents of adisk. Database researchers have historically focused on the importanceof this latter form of caching (also known as a “buffer pool”). However,recent studies have demonstrated that even with traditionaldisk-oriented databases, roughly 50% or more of execution time is oftenwasted due to SRAM cache misses. For main-memory databases, it is evenclearer that SRAM cache performance is crucial. Hence, attention hasbeen directed in revisiting core database algorithms in an effort tomake them more cache friendly.

Index structures are used extensively throughout database systems, andthey are often implemented as B⁺-Trees. While database managementsystems perform several different operations that involve B⁺-Treeindices (e.g., selections, joins, etc.), these higher-level operationscan be decomposed into two key lower-level access patterns. One of theseis searching for a particular key, which involves descending from theroot node to a leaf node using binary search within a given node todetermine which child pointer to follow. The other is scanning someportion of the index, which involves traversing the leaves (leaf nodes)through a linked-list structure for a non-clustered index. For clusteredindices, one can directly scan the database table after searching forthe starting key. While search time is the key factor in single valueselections and nested loop index joins, scan time is the dominant effectin range selections.

An example of cache performance of both search and scan on B⁺-Treeindices may be considered by simulating their performance using a memorysubsystem comparable to that associated with a Compaq ES40. A searchexperiment may look up 100,000 random keys in a main-memory B⁺-Treeindex after it has been bulkloaded with 10 million keys. A scanexperiment performs 100 range scan operations starting at random keys,each of which scans through one million (key, tupleID) pairs retrievingthe tupleID values. The results for shorter range scans (e.g., 1000tuple scans) are similar. The B⁺-Tree node size is equal to the cacheline size, which is 64 bytes in this example. The results may be brokendown into the three categories of busy time, data cache stalls, andother stalls. Results of the experiment indicate that both search andscan accesses on B⁺-Tree indices spend a significant fraction of theirtime (i.e., 65% and 84%, respectively) stalled on data cache misses.Hence there appears to be considerable room for improvement.

In an effort to improve the cache performance of index searches formain-memory databases, the two other types of index structurescache-sensitive search trees” (CSS-Trees) and cache-sensitive B⁺- Trees(CSB⁺-Trees) have been studied. The premise of these studies is theconventional wisdom that the optimal tree node size is equal to thenatural data transfer size. This corresponds to the disk page size fordisk-resident databases and the cache line size for main-memorydatabases. Because cache lines are roughly two orders of magnitudesmaller than disk pages (e.g., 64 bytes vs. 4 Kbytes), the resultingindex trees for main-memory databases are considerably deeper. Since thenumber of expensive cache misses is roughly proportional to the heightof the tree, it would be desirable to somehow increase the effectivefanout (also called the branching factor) of the tree, without payingthe price of additional cache misses that this would normally imply.

This may be accomplished by restricting the data layout such that thelocation of each child node can be directly computed from the parentnode's address (or a single pointer) thereby eleminating all or nearlyall of the child pointers. Assuming that keys and pointers are the samesize, this effectively doubles the fanout of cache-line-sized treenodes, thus reducing the height of the tree and the number of cachemisses. CSS-Trees eliminate all child pointers, but do not supportincremental updates and therefore are only suitable for read-onlyenvironments. CSB⁺-Trees do support updates by retaining a singlepointer per non-leaf node that points to a contiguous block of itschildren. Although CSB⁺-Trees outperform B⁺-Trees on searches, theystill perform significantly worse on updates due to the overheads ofkeeping all children for a given node in sequential order withincontiguous memory, especially during node splits.

The execution time of CSB⁺-Trees (normalized to that of B⁺-Trees) forthe same index search experiment eliminate 20% of the data cache stalltime, thus resulting in an overall speedup of 1.15 for searches. Whilethis is a significant improvement, over half of the remaining executiontime is still being lost due to data cache misses. In addition, thesesearch-oriented optimizations provide no benefit to scan accesses, whichsuffer even more from data cache misses.

Accordingly, what is needed in the art is a way to enhance theeffectiveness and efficiency of database searches and scans.

SUMMARY OF THE INVENTION

To address the above-discussed deficiencies of the prior art, thepresent invention provides a prefetch system for use with a cache memoryassociated with a database employing indices. In one embodiment, theprefetch system includes a search subsystem configured to prefetch cachelines containing an index of a node of a tree structure associated withthe database. Additionally, the prefetch system also includes a scansubsystem configured to prefetch cache lines based on an index prefetchdistance between first and second leaf nodes of the tree structure.

In another aspect, the present invention provides a method ofprefetching for use with a cache memory associated with a databaseemploying indices. The method includes prefetching cache linescontaining an index of a node of a tree structure associated with thedatabase. The method also includes prefetching cache lines based on anindex prefetch distance between first and second leaf nodes of the treestructure.

The present invention also provides, in yet another aspect, a databasemanagement system including a computer employing a central processingunit, a main memory containing a database employing indices and a cachememory associated with the central processing unit and the main memory.The database management system also includes a prefetch system for usewith the cache memory that is coupled to the database employing indices.The prefetch system has a search subsystem that prefetches cache linescontaining an index of a node of a tree structure associated with thedatabase, and a scan subsystem that prefetches cache lines based on anindex prefetch distance between first and second leaf nodes of the treestructure.

The foregoing has outlined, rather broadly, preferred and alternativefeatures of the present invention so that those skilled in the art maybetter understand the detailed description of the invention thatfollows. Additional features of the invention will be describedhereinafter that form the subject of the claims of the invention. Thoseskilled in the art should appreciate that they can readily use thedisclosed conception and specific embodiment as a basis for designing ormodifying other structures for carrying out the same purposes of thepresent invention. Those skilled in the art should also realize thatsuch equivalent constructions do not depart from the spirit and scope ofthe invention in its broadest form.

BRIEF DESCRIPTION OF THE DRAWINGS

For a more complete understanding of the present invention, reference isnow made to the following descriptions taken in conjunction with theaccompanying drawings, in which:

FIG. 1 illustrates a system diagram of a database management systemconstructed in accordance with the principles of the present invention;

FIG. 2 illustrates a block diagram of a prefetch system constructed inaccordance with the principles of the present invention;

FIGS. 3A, 3B, and 3C collectively illustrate graphs showing cache missesversus cycle times for various search situations;

FIGS. 4A, 4B, and 4C collectively illustrate graphs showing cache missesversus cycle times for various scan situations;

FIG. 5A illustrates a block diagram showing an embodiment of a treestructure employing an independent and contiguous jump pointer array;

FIG. 5B illustrates a block diagram showing an embodiment of a treestructure employing a chunked, independent jump pointer arrayconstructed in accordance with the principles of the present invention;

FIG. 6 illustrates a block diagram of an embodiment of a tree structureemploying an internal jump pointer array constructed in accordance withthe principles of the present invention; and

FIG. 7 illustrates a flow diagram of an embodiment of a method ofprefetching constructed in accordance with the principles of the presentinvention.

DETAILED DESCRIPTION

Referring initially to FIG. 1, illustrated is a system diagram of adatabase management system, generally designated 100, constructed inaccordance with the principles of the present invention. The databasemanagement system 100 includes a computer 105 having a centralprocessing unit 110, a main memory 115 containing a database employingindices, a cache memory 120, associated with the central processing unit110 and the main memory 115, and a prefetch system 125. The prefetchsystem 125 cooperates with the cache memory 120 that is coupled to thedatabase employing indices and includes a search subsystem thatprefetches cache lines containing an index of a node of a tree structureassociated with the database. The prefetch system 125 also includes ascan subsystem that prefetches cache lines based on an index prefetchdistance between first and second leaf nodes of the tree structure.

In an exemplary embodiment, the prefetch system 125 may be embodiedentirely in a software configuration that resides in the computer 105.In an alternative embodiment, the prefetch system 125 may be embodiedentirely in a hardware configuration that is associated with thecomputer 105. In yet another embodiment, the prefetch system 125 mayhave a configuration that is composed partially of software andpartially of hardware that operate in concert to perform prefetching.

The computer 105 may provide mechanisms for coping with large cache misslatencies. It may allow multiple outstanding cache misses to be inflight simultaneously for the sake of exploiting parallelism within amemory hierarchy. For example, a computer system such as the Compaq ES40supports 32 in-flight loads, 32 in-flight stores, and eight outstandingoff-chip cache misses per processor. Also, its crossbar memory systemsupports 24 outstanding cache misses. Additionally, to help applicationstake full advantage of this parallelism, such computer systems alsoprovide prefetch instructions which enable software to move data intothe cache memory before it is needed.

Prefetching may successfully hide much of the performance impact ofcache misses by overlapping them for both array-based and pointer-basedprogram codes. Performance gain increases may be defined as an originaltime divided by the improved time associated with computation and othermisses. Therefore for modern machines, it is not the number of cachemisses that dictates performance, but rather the amount of exposed misslatency that cannot be successfully hidden through techniques such asprefetching.

In the illustrated embodiment of the present invention, PrefetchingB⁺-Trees (pB⁺-Trees), which use prefetching to limit the exposed misslatency are presented. Tree-based indices such as B⁺-Trees pose a majorchallenge for prefetching search and scan accesses since both accesspatterns suffer from a pointer-chasing problem. That is, the datadependencies through pointers make it difficult to prefetch sufficientlyfar ahead to appropriately limit the exposed miss latency. For indexsearches, pB⁺-Trees reduce this problem by having wider nodes than thenatural data transfer size. For example, eight cache lines verse onecache line (or disk pages) may be employed. These wider nodes reduce theheight of the tree structure, thereby decreasing the number of expensivemisses when going from a parent node to a child node.

By the appropriate use of prefetching, the advantageously wider treenodes may be achieved at small additional cost since all of the cachelines in a wider node can be fetched almost as quickly as the singlecache line of a traditional node. To accelerate index scans, arrays ofpointers are introduced to the B⁺-Tree leaf nodes, which allowprefetching arbitrarily far ahead. This action thereby hides thenormally expensive cache misses associated with traversing the leaveswithin the range. Of course, indices may be frequently updated.Insertion and deletion times typically decrease despite any overheadsassociated with maintaining the wider nodes and the arrays of pointers.Contrary to conventional wisdom, the optimal B⁺-Tree node size on amodern machine is often wider than the natural data transfer size, sinceprefetching allows fetching each piece of the node simultaneously.

In an alternative embodiment, the following advantages relative toCSB⁺-Trees may be provided. Better search performance is achieved due toan increase in the fanout by more than the factor of two that CSB⁺-Treesprovide (e.g.,.a factor of eight in the illustrated embodiment).Enhanced performance is achieved on updates relative to B⁺-Trees, sincean improved search speed more than offsets any increase in node splitcost due to wider nodes. Also, fundamental changes are not required tothe original B⁺-Tree data structures or algorithms. In addition, theapproach is complementary to CSB⁺-Trees.

Typically, the pB⁺-Tree may effectively hide over 90% of the cache misslatency suffered by (non-clustered) index scans. This may result in afactor of 6.5 to 8.7 speedup over a range of scan lengths. Although theillustrated embodiment employs the context of a main memory database,alternative embodiments are also applicable to hiding disk latency, inwhich case the prefetches will move data from a disk (such as a harddrive or floppy drive) into main memory.

Turning now to FIG. 2, illustrated is a block diagram of a prefetchsystem, generally designated 200, constructed in accordance with theprinciples of the present invention. In the illustrated embodiment, theprefetch system 200 may be employed with a cache memory that isassociated with a database employing indices. The prefetch system 200includes a search subsystem 205 and a scan subsystem 210. The searchsubsystem 205 is configured to prefetch cache lines containing an indexof a node of a tree structure associated with the database. The scansubsystem 210 is configured to prefetch cache lines based on an indexprefetch distance between first and second leaf nodes of the treestructure.

In the illustrated embodiment, the search subsystem 205 is configured toemploy a binary search and to prefetch each cache line associated with aselected node of the tree structure. Alternatively, the search subsystem205 may be configured to prefetch each cache line associated withselected nodes along a path from a root node to a leaf node of the treestructure. The scan subsystem 210 may determine the index prefetchdistance employing an external jump pointer array, or it may bedetermined by an internal jump pointer array. In an exemplaryembodiment, the internal jump pointer array is configured to employ atleast two bottom non-leaf nodes of the tree structure.

During a B⁺-Tree search without prefetching, the process typicallystarts from the root of the tree structure and performs a binary searchin each non-leaf node to determine which child to visit next. Uponreaching a leaf node, a final binary search returns the key position. Atleast one expensive cache miss may be expected to occur each time alevel is traversed in the tree. Therefore, the number of cache misses isroughly proportional to the height of the tree (minus any nodes thatmight remain in the cache if the index is reused). In the absence ofprefetching when all cache misses are equally expensive and cannot beoverlapped, making the tree nodes wider than the natural data transfersize (i.e., a cache line for main-memory databases or a disk page fordisk-resident databases) actually reduces performance. This occurs sincethe number of additional cache misses at each node more than offsets thebenefits of reducing the number of levels in the tree.

Turning now to FIGS. 3A, 3B, and 3C collectively, illustrated aregraphs, generally designated 300A, 300B, 300C respectively, showingcache misses versus cycle times for various search situations. As anexample, consider a main-memory B⁺-Tree holding 1000 keys where thecache line size is 64 bytes and the keys, child pointers, and tupleIDsare all four bytes. Limiting the node size to one cache line, theB⁺-Tree will contain at least four levels. FIG. 3A illustrates aresulting cache behavior where the four cache misses cost a total of 600cycles on an exemplary Compaq ES40-based machine. Doubling the nodewidth to two cache lines allows the height of the B⁺-Tree to be reducedto three levels. However, as seen in FIG. 3B, this may result in cachebehavior having six cache misses, thereby increasing execution time by50%.

With prefetching, it becomes possible to hide the latency of any misswhose address can be predicted sufficiently early. Returning to theexample and prefetching the second half of each two-cache-line-wide treenode so that it is fetched in parallel with the first half (as shown inFIG. 3C) results in significantly better performance being achieved asmay be compared with the one-cache-line-wide nodes as shown in FIG. 3A.The extent to which the misses may be overlapped depends upon theimplementation details of a memory hierarchy. However, a current trendis toward supporting greater parallelism. In fact, with multiple caches,memory banks and crossbar interconnects, it may be possible tocompletely overlap multiple cache misses.

FIG. 3c may be illustrative of the timing for an exemplary CompaqES40-based machine, where back-to-back misses to memory can be servicedonce every 10 cycles. This is a small fraction of the overall 150 cyclemiss latency. Even without perfect overlap of the misses, largeperformance gains may still potentially be achieve (a speedup of 1.25 inthis example) by creating wider than normal B⁺-Tree nodes. Therefore, aprimary aspect of the pB⁺-Tree design is to use prefetching to “create”nodes that are wider than the natural data transfer size, but where theentire miss penalty for each extra-wide node is comparable to that of anoriginal B⁺-Tree node.

Modifications to the B⁺-Tree algorithm may be addressed by firstconsidering a standard B⁺-Tree node structure. With reference to Table 1for definition of notations, each non-leaf node includes some number,d>>1 (where >> indicates “much greater than”) of childptr fields, d−1key fields, and one keynum field that records the number of keys storedin the node (at most d−1). Each leaf node includes d−1 key fields, d−1tupleID fields, one keynum field, and one next-leaf field that points tothe next leaf node in key order. A first modification is to store thekeynum and all of the keys prior to any of the pointers or tupleIDs in anode. This simple layout optimization allows a binary search to proceedwithout waiting to fetch all the pointers. For the illustratedembodiment, salient changes in the search algorithm from a standardB⁺-Tree algorithm may be summarized as follows.

TABLE 1 Definition of Notations Variable Definition ω # of cache linesin an index node m # of child pointers in a one-line-wide node N # of(key; tupleID) pairs in an index d # of child pointers in non-leaf node(= ω × m) T₁ full latency of a cache miss T_(next) latency of anadditional pipelined cache miss B${normalized}\quad {memory}\quad {bandwidth}\quad \left( {B = \frac{T_{1}}{T_{next}}} \right)$

k # of nodes to prefetch ahead c # of cache lines in jump-pointer arraychunk

Before starting a binary search, all of the cache lines that comprisethe node are prefetched. Since an index search is first performed tolocate the position for an insertion, all of the nodes on the path fromthe root to the leaf are already in the cache before the real insertionphase. Therefore, the only additional cache misses are caused by newlyallocated nodes, which are prefetched in their entirety beforeredistributing the keys. For deletions, a “lazy deletion” is performed.If more than one key is in the node, the key is simply deleted. It isonly when the last key in a node is deleted that a redistribution ofkeys or deletion of the node is attempted. Since an index search is alsoperformed prior to deletion, the entire root-to-leaf path is in thecache. Key redistribution is the only potential cause of additionalcache misses. Therefore, when all keys in a node are deleted, thesibling node is prefetched from which keys may be redistributed.

Prefetching may also be used to accelerate the bulkload of a B⁺-Tree.However, because this is expected to occur infrequently, attention hasbeen focused on the more frequent operations of search, insertion anddeletion. In the illustrated embodiment, fixed-size keys, tupleIDs, andpointers are considered for simplicity. It is also assumed that tupleIDsand pointers are the same size. One skilled in the pertinent art willrealize that these conditions are only exemplary and that otherconditions are well within the broad scope of the present invention.

As discussed earlier, search times are expected to improve through thescheme presented since it reduces the number of levels in the B⁺-Treewithout significantly increasing the cost of accessing each level.Updates always begin with a search phase. The expensive operationstypically occur either when a node becomes too full upon an insertionand must be split, or when a node becomes empty upon a deletion and keysmust be redistributed. Although node splits and key redistributions aremore costly with larger nodes, the relative frequency of these expensiveevents may typically decrease. Therefore, update performance is expectedto be comparable to, or perhaps even better than, B⁺-Trees withsingle-line nodes.

The space overhead of the index is reduced with wider nodes. This isprimarily due to the reduction in the number of non-leaf nodes. For afull tree, each leaf node contains d−1 (key, tupleID) pairs. The numberof non-leaf nodes is dominated by the number of nodes in the levelimmediately above the leaf nodes and therefore approximately equal toN/d(d−1). As the fanout d increases with wider nodes, the node sizegrows linearly but the number of non-leaf nodes decreases quadratically,resulting in a nearly linear decrease in the non-leaf space overhead.

There are two system parameters that typically affect a determination ofan optimal node size, given prefetching. The first system parameter isthe extent to which the memory subsystem can overlap multiple cachemisses. This may be quantified as the latency of a full cache miss T₁divided by the additional time until another pipelined cache missT_(next) would also complete. This ratio (i.e., T₁/T_(next)) may becalled the normalized bandwidth B. For example, in the Compaq ES40-basedmachine example, T₁=150 cycles, T_(next)=10 cycles, and therefore thenormalized bandwidth B=15.

The larger a value of the normalized bandwidth B, the greater theability of a system to overlap parallel accesses, and the greater alikelihood of benefitting from wider index nodes. In general, it may beexpected that an optimal number of cache lines per node w_(optimal) willnot exceed the normalized bandwidth B. Beyond that point a binary searchcould have been competed with readiness to move down to the next levelin the tree. The second system parameter that potentially limits theoptimal node size is the size of the cache, although in practice thisdoes not appear to be a limitation given realistic values of thenormalized bandwidth B.

Now consider a more quantitative analysis of an optimal node widthw_(optimal). A pB⁺-Tree with N (key, tupleID) pairs contains at least$\left\lbrack {{\log_{d}\left( \frac{N}{d - 1} \right)} + 1} \right\rbrack$

levels. Using a data layout optimization that employs putting keysbefore child pointers, three out of four nodes are read on average.Therefore, the average memory stall time for a search in a full tree maybe expressed as:${\left\lbrack {{\log_{d}\left( \frac{N}{d - 1} \right)} + 1} \right\rbrack \times \left( {{T1} + {\left( {\left\lbrack \frac{3w}{4} \right\rbrack - 1} \right) \times T_{next}}} \right)} = {T_{next} \times \left\lbrack {{{\log \quad}_{wm}\frac{N}{{wm} - 1}} + 1} \right\rbrack \times \left( {B + \left\lbrack \frac{3w}{4} \right\rbrack - 1} \right)}$

By computing the value of w that minimizes this cost, we can find theoptimal node width w_(optimal). For example, in our simulations wherem=8 and B=15, by averaging over tree sizes N=10³, . . . , 10⁹, it may becomputed from the equation above that the optimal node width w_(optimal)equals 8. If the memory subsystem bandwidth increases such that B equals50, then the optimal node width w_(optimal) increases to 22. Whencomparing the pB⁺-Trees with conventional B⁺-Trees, better searchperformance, comparable or somewhat better update performance, and lowerspace overhead may be expected.

Having addressed index search performance, index range scans will now beaddressed. Given starting and ending keys as arguments, an index rangescan returns a list of either the tupleIDs or the tuples themselves withkeys that fall within this range. First, a starting key is searched inthe B⁺-Tree to locate a starting leaf node. Then a scan follows thenext-leaf pointers, visiting the leaf nodes in order. As the scanproceeds, the tupleIDs (or tuples) are copied into a return buffer. Thisprocess stops when either the ending key is found or the return bufferfills up. In the latter case, the scan procedure pauses and returns thebuffer to the caller (often a join node in a query execution plan),which then consumes the data and resumes the scan where it left off.Therefore, range selection involves one key search and often multipleleaf node scan calls. Range selections that return tupleIDs will bespecifically addressed, although returning the tuples themselves orother variations is a straightforward extension of the algorithm.

Turning now to FIGS. 4A, 4B, and 4C collectively, illustrated aregraphs, generally designated 400A, 400B, 400C respectively, showingcache misses versus cycle times for various scan situations. In generaland without prefetching, the cache performance of range scans may sufferby over 84% due to data cache misses. FIG. 3A illustrates this situationwhere a full cache miss latency is suffered for each leaf node. Apartial solution to this situation is to make the leaf nodes multiplecache lines wide and prefetch each component of a leaf node in parallel.This may reduce the frequency of expensive cache misses, as illustratedin FIG. 4B. While this is helpful, a goal to hide fully the misslatencies to the extent permitted by the memory system, may be achievedas illustrated in FIG. 4C. To accomplish this goal, a pointer-chasingproblem needs to be overcome.

Assuming that three nodes worth of computation are needed to hide a misslatency, then when a node n_(i) is visited one would like to belaunching a prefetch of a node n_(i+3). To compute the address of thenode n_(i+3), the pointer chain would normally follow through the nodesn_(i+1) and n_(i+2). However, this would incur a full miss latency tofetch the node n_(i+1) and then to fetch the node n_(i+2), before theprefetch of the node n_(i+3) could be launched, thereby defeating ourgoal of hiding the miss latency of n_(i+3). The concept of jump pointersthat are customize to the specific needs of B⁺-Tree indices may beemployed.

In an exemplary embodiment of the present invention, jump pointer arraysare separate arrays that store these jump pointers, rather than storingjump pointers directly in the leaf nodes. The jump pointer arrays mayalso employ a back-pointer associated with a starting leaf node tolocate the leaf's position within the jump pointer array. Then, based ona desired index prefetching distance, an array offset is adjusted tofind the address of the appropriate leaf node to prefetch. As the scanproceeds, the prefetching task simply continues to walk ahead in thejump-pointer array (which itself is also prefetched) without having todereference the actual leaf nodes again.

Jump-pointer arrays are more flexible than jump pointers stored directlyin the leaf nodes. The prefetching distance may be adjusted by simplychanging the offset used within the jump pointer array. This allowsdynamic adaptation to changing performance conditions on a givenmachine, or if an associated software code is migrated to a differentmachine. In addition, the same jump-pointer array can be reused totarget different latencies in the memory hierarchy (e.g., disk latencyvs. memory latency). From an abstract perspective, one might think ofthe jump-pointer array as a single large, contiguous array as will bediscussed with respect to FIG. 5A below. This configuration may beefficient in read-only situations, but would typically create problemsin other situations. A key issue in implementing jump-pointer arrays mayinvolve updates.

Turning now to FIG. 5A, illustrated is a block diagram showing anembodiment of a tree structure, generally designated 500A, employing anindependent and contiguous jump pointer array. The tree structure 500includes a collection of non-leaf nodes collectively designated 505A, acollection of leaf nodes collectively designated 510A and a jump pointerarray 515A having a single, contiguous array arrangement. Thisindependent, single, contiguous jump pointer array 515A may create aproblem during updates, however. When a leaf is deleted, an empty slotis typically left in the single contiguous array. When a new leaf isinserted, an empty slot needs to be created in the appropriate positionfor a new jump pointer. If no nearby empty slots can be located, thismay potentially involve copying a very large amount of data within thesingle contiguous jump pointer array 515A to create the empty slot. Inaddition, for each jump-pointer that is moved within the singlecontiguous jump pointer array 515A, the corresponding back-pointer fromthe leaf node into the array also needs to be updated, which may be verycostly to performance.

Turning now to FIG. 5B, illustrated is a block diagram showing anembodiment of a tree structure, generally designated 500B, employing achunked, independent jump pointer array constructed in accordance withthe principles of the present invention. The tree structure 500Bincludes a collection of non-leaf nodes 505B, a collection of leaf nodes510B and a chunked jump point array 515B having link lists with hintback-pointers. The chunked jump point array 515B allows severalperformance improvements over a contiguous jump pointer array such asthat discussed with respect to FIG. 5A above.

First, breaking a contiguous array into a chunked linked list, asillustrated in FIG. 5B, allows the impact of an insertion to itscorresponding chunk to be limited. Second, actively attempting tointerleave empty slots within the chunked jump pointer array 515B allowsinsertions to proceed more quickly. During a bulkload or when a chunksplits, the jump pointers are stored such that empty slots are evenlydistributed to maximize the chance of finding a nearby empty slot forinsertion. When a jump-pointer is deleted, an empty slot in the chunk isleft.

Finally, the meaning of a back-pointer in a leaf node corresponding toits position in the jump-pointer array is altered such that it is merelya “hint”. The pointer may point to the correct chunk, but a positionwithin that chunk may be imprecise. Therefore when moving jump pointersin a chunk for inserting a new leaf address, there is no need to updatethe hints for the moved jump pointers. A hint field may be updated whenthe precise position in the jump-pointer array is looked up during arange scan or insertion.

In this case the leaf node should be already in cache and updating thehint requires minimum overhead. Additionally, a hint field may beupdated when a chunk splits and addresses are redistributed. In thiscase, updating the hints to point to the new chunk is forced. The costof using hints, of course, is that searching for the correct locationwithin the chunk in some cases is required. In practice, however, thehints appear to be good approximations of the true positions, andsearching for the precise location is not a costly operation (e.g., itshould not incur many if any cache misses).

In summary, the net effect of these enhancements is that nothing movesduring deletions, and typically only a small number of jump pointers(between the insertion position and the nearest empty slot) move duringinsertions. In neither case does updating the hints within the leafnodes normally occur. Thus we expect these jump-pointer arrays toperform well during updates.

Having described a data structure to facilitate prefetching, anexemplary embodiment of a prefetching algorithm may now be described.Recall that a basic range scan algorithm consists of a loop that visitsa leaf on each iteration by following a next-leaf pointer. To supportprefetching, prefetches are added both prior to this loop (for thestartup phase), and inside the loop (for the steady-state phase). Let kbe the desired prefetching distance, in units of leaf nodes (a selectionapproach for selecting k is discuss below). During a startup phase,prefetches are issued for the first k leaf nodes. These prefetchesproceed in parallel, exploiting the available memory hierarchybandwidth. During each loop iteration (i.e., in the steady-state phase)and prior to visiting the current leaf node in the range scan, the leafnode that is k nodes after the current leaf node is prefetched. The goalis to ensure that by the time the basic range scan loop is ready tovisit a leaf node, that node is already prefetched into the cache. Withthis framework in mind, further details of an exemplary embodiment maybe described.

First, in a startup phase, it is advantageous to locate the jump pointerof the starting leaf within the jump-pointer array. To do this, followthe hint pointer from the starting leaf to see whether it is precise(i.e., whether the hint points to a pointer back to the starting leaf).If not, then start searching within the chunk in both directionsrelative to the hint position until the matching position is found.Usually, the distance between the hint and the actual position appearsto be small in practice.

Second, prefetch the jump-pointer chunks as well as the leaf nodes andhandle empty slots in the chunks. During a startup phase, both thecurrent chunk and the next chunk are prefetched. When looking for a jumppointer, test for and skip all empty slots. If the end of the currentchunk is reached, go to the next chunk to get the first non-emptyjump-pointer (there is at least one non-empty jump pointer or the chunkshould have been deleted). Then, prefetch the next chunk ahead in thejump-pointer array. The next chunk is expected to be in the cache by thetime it is accessed since it is always prefetched before prefetching anyleaf nodes pointed to by the current chunk. Third, although the actualnumber of tupleIDs in the leaf node is unknown when range prefetching isdone, it is assumed that the leaf is full and the return buffer area isprefetched accordingly. Thus, the return buffer will always beprefetched sufficiently early.

Selecting the index prefetching distance and the chunk size may now beaddressed. A value of an index prefetching distance k, (where the valueof the prefetching distance k is in units of nodes to prefetch ahead)may be selected as follows. Normally this quantity is derived bydividing the expected worst-case miss latency by the computation timespent on one leaf node. However, because the computation time associatedwith visiting a leaf node during a range scan is quite small relative tothe miss latency, it will be assumed that the limiting factor is thememory bandwidth B. One may estimate this bandwidth-limited prefetchingdistance as k=B/w, where B is the normalized memory bandwidth and w isthe number of cache lines per leaf node, as defined in Table 1. Inpractice, there is no problem with increasing k to create a performancemargin, since any prefetches that cannot proceed are simply bufferedwithin the memory system.

In selecting a chunk size c, chunks must be sufficiently large to ensurethat prefetching one chunk ahead to hide a miss latency of accessing thechunks themselves is assured. During the steady-state phase of a rangescan associated with obtaining a new chunk, the next chunk isimmediately prefetched ahead so that its fetch time can be overlappedwith the time it takes to prefetch the leaf nodes associated with thecurrent chunk. Since the memory hierarchy only has enough bandwidth toinitiate B cache misses during the time it takes one cache miss tocomplete, the chunks would clearly be large enough to hide the latencyof fetching the next chunk if they contained at least B leaf pointers(there is at least one cache line access for every leaf visit).

For a full tree with no empty leaf slots and no empty chunk slots, eachcache line can hold 2 m leaf pointers (since there are only pointers andno keys). In this case in can be estimated that the minimum chunk size cin units of cache lines is c=B/2 m. To account for empty chunk slots,the denominator (2 m) may be multiplied by the occupancy of chunk slots(a value similar to the bulkload factor), which would increase the valueof the minimum chunk size c somewhat.

Another factor that may dictate the minimum chunk size c is that eachchunk should contain at least k leaf pointers so that the prefetchingalgorithm may operate sufficiently far ahead. However, since theprefetching distance k is less than or equal to the normalized memorybandwidth B, the chunk size c in the equation above should besufficient. Increasing the chunk size c beyond this minimum value toaccount for empty leaf nodes and empty chunk slots will typicallyimprove performance, however. Given sufficient memory system bandwidth,the prefetching scheme of this exemplary embodiment tends to hide thefull memory latency experienced at every leaf node visited during rangescan operations. Additionally, good performance on updates is alsoanticipated.

However, there is a space overhead associated with employing ajump-pointer array. Since the jump pointer array may only contain onepointer per leaf node, the space overhead is relatively small. Since anext-leaf pointer and a back-pointer are stored in every leaf, there areat most d−2(key, tupleID) pairs in every leaf nodes (where d is definedin Table 1). So, the jump pointer for a full leaf node only takes ½(d−2)as much space as the leaf node. The resulting increase in the fanout dfor creating wider B⁺-Tree nodes will help reduce this overhead.However, this space overhead may be reduced further.

Turning now to FIG. 6, illustrated is a block diagram of an embodimentof a tree structure, generally designated 600, employing an internaljump pointer array constructed in accordance with the principles of thepresent invention. The tree structure 600 includes a collection ofnon-leaf nodes 605, a collection of leaf nodes 610 and a collection ofinternal jump pointer arrays 615. In the preceding embodiments,discussions and examples were presented that described how ajump-pointer array may be implemented by creating a new externalstructure to store the jump pointers (as illustrated earlier withrespect to FIGS. 5A and 5B).

However, there is an existing structure within a B⁺-Tree, for example,that already contains pointers to the leaf nodes, namely, the parents ofthe leaf nodes. These parent nodes may be called bottom non-leaf nodes.Child pointers within a bottom non-leaf node correspond to thejump-pointers within a chunk of the external jump-pointer array as wasdescribed above. A key difference, however, is that there is typicallyno easy way to traverse these bottom non-leaf nodes quickly enough toperform prefetching. A potential solution is to connect these bottomnon-leaf nodes together in leaf key order using linked-list pointers.FIG. 6 illustrates this concept as the internal jump-pointer arrays 615.It may be noted that leaf nodes do not contain back-pointers to theirpositions within their parents. However, such pointers are not necessaryfor this internal implementation, since the position will be determinedduring the search for the starting key. Simply retaining the result ofthe binary search of the bottom non-leaf nodes 610, will produce theposition to appropriately initiate the prefetching operation.

This approach is attractive with respect to space overhead, since theonly overhead is one additional pointer per bottom non-leaf node 610.The overhead of updating this pointer may be substantiallyinsignificant, since it only needs to be changed in the rare event thata bottom non-leaf node splits or is deleted. A potential limitation ofthis approach, however, is that the length of a “chunk” in thisembodiment of a jump-pointer array is dictated by the B⁺-Tree structureand may not be easily adjusted to satisfy large prefetch distancerequirements (e.g., for hiding disk latencies)

Turning now to FIG. 7, illustrated is a flow diagram of an embodiment ofa method of prefetching, generally designated 700, constructed inaccordance with the principles of the present invention. The method 700,for use with a cache memory associated with a database employingindices, starts in a step 705 and proceeds to a first decisional step710. The first decisional step 710 determines if a database search is tobe performed, and the method 700 proceeds to a step 715 if a search isto be performed.

A search includes prefetching cache lines containing an index of a nodeof a tree structure associated with the database. The step 715identifies the index nodes whose associated cache lines are to beprefetched. In the illustrated embodiment of method 700, a binary searchis performed and each cache line associated with selected nodes along apath from a root node to a leaf node of the tree structure may beconsidered for prefetching. In an alternative embodiment, each cacheline associated with a selected node of the tree structure may beconsidered for prefetching. Then, in a step 720, the appropriate cachelines are prefetched and the method 700 ends in a step 725.

Returning to the first decisional step 710 and for the case where adatabase search is not to be performed, a second decisional step 730determines if a database scan is to be performed. Performing a databasescan includes prefetching cache lines based on an index prefetchdistance between first and second leaf nodes of the tree structure. Theindex prefetch distance is determined in a step 735 employing anexternal jump pointer array or an internal jump pointer array whereinthe internal jump pointer array uses at least two bottom non-leaf nodesof the tree structure. Then, in a step 740, the appropriate cache linesare prefetched and the method 700 ends in a step 725. If a database scanis not to be performed in the second decisional step 730, the method 700returns to the first decisional step 710.

In summary, several embodiments of a prefetch system for use with acache memory that is associate d with a database employing indices havebeen provided. Additionally, embodiments of a database management systememploying the prefetch system and a method of prefetching have also beenprovided. In general, the prefetch system and method of prefetchingaccelerate the two important operations of searches and range scans onB⁺-Tree indices. To accelerate searches, pB⁺-Trees use prefetching toeffectively create wider nodes than the natural data transfer size(e.g., eight vs. one cache lines or disk pages). These wider nodesreduce the height of the B⁺-Tree, thereby decreasing the number ofexpensive misses when going from parent to child without significantlyincreasing the cost of fetching a given node.

The results of the embodiments presented indicate that for an indexsearch, the prefetch system may achieve an increase of 1.27 to 1.55 overthe B⁺-Tree, by decreasing the height of the tree. Additionally, for anindex scan, the prefetch system may achieve an increase of 3.5 to 3.7over the B⁺-Tree, again due to the faster search and wider nodes.Moreover, jump-pointer arrays were proposed, which enable effectiverange scan prefetching across node boundaries. Overall, the pB⁺-Treeachieves a speedup of about 6.5 to 8.7 over the B⁺-Tree for range scans.For index updates (insertions and deletions), the technique may achievean increase of 1.24 to 1.52 over the B⁺-Tree, due to faster search andless frequent node splits with wider nodes. Of course, application ofthe principles of the present invention to other current or futuredeveloped tree structures is well within the broad scope of the presentinvention.

Although the present invention has been described in detail, thoseskilled in the art should understand that they can make various changes,substitutions and alterations herein without departing from the spiritand scope of the invention in its broadest form.

What is claimed is:
 1. A method of prefetching for use with a cachememory associated with a database employing indices, comprising:prefetching cache lines containing an index of a node of a treestructure associated with said database; and prefetching at least onecache line associated with a second leaf node of said tree structurethat is located an index prefetch distance from a first leaf node saidtree structure.
 2. The method as recited in claim 1 wherein saidprefetching comprises prefetching each cache line associated with aselected node of said tree structure.
 3. The method as recited in claim1 wherein said prefetching comprises prefetching each cache lineassociated with selected nodes along a path from a root node to a leafnode of said tree structure.
 4. The method as recited in claim 1 whereinsaid prefetching employs a binary search without waiting to fetchpointers of nodes of said tree structure.
 5. The method as recited inclaim 1 wherein said index prefetch distance is determined by anexternal or internal jump pointer array.
 6. The method as recited inclaim 5 wherein employing said internal jump pointer array employs atleast two bottom non-leaf nodes of said tree structure.
 7. The method asrecited in claim 1 wherein said index prefetch distance is limited by abandwidth of a memory associated with said prefetch system.
 8. Aprefetch system for use with a cache memory associated with a databaseemploying indices, comprising: a search subsystem configured to prefetchcache lines containing an index of a node of a tree structure associatedwith said database; and a scan subsystem configured to prefetch at leastone cache line associated with a second leaf node of said tree structurethat is located an index prefetch distance from a first leaf node ofsaid tree structure.
 9. The prefetch system as recited in claim 8wherein said search subsystem is configured to prefetch each cache lineassociated with a selected node of said tree structure.
 10. The prefetchsystem as recited in claim 8 wherein said search subsystem is configuredto prefetch each cache line associated with selected nodes along a pathfrom a root node to a leaf node of said tree structure.
 11. The prefetchsystem as recited in claim 8 wherein said search subsystem is configuredto employ a binary search without waiting to fetch pointers of nodes ofsaid tree structure.
 12. The prefetch system as recited in claim 8wherein said index prefetch distance is determined by an external orinternal jump pointer array.
 13. The prefetch system as recited in claim12 wherein said internal jump pointer array is configured to employ atleast two bottom non-leaf nodes of said tree structure.
 14. The prefetchsystem as recited in claim 8 wherein said index prefetch distance islimited by a bandwidth of a memory associated with said prefetch system.15. A database management system, comprising: a computer employing acentral processing unit; a main memory containing a database employingindices; a cache memory associated with said central processing unit andsaid main memory; and a prefetch system for use with said cache memorythat is coupled to said database employing indices, including: a searchsubsystem that prefetches cache lines containing an index of a node of atree structure associated with said database; and a scan subsystem thatprefetches at least one cache line associated with a second leaf node ofsaid tree structure that is located an index prefetch distance from afirst leaf node of said tree structure.
 16. The database managementsystem as recited in claim 15 wherein said search subsystem prefetcheseach cache line associated with a selected node of said tree structure.17. The database management system as recited in claim 15 wherein saidsearch subsystem prefetches each cache line associated with selectednodes along a path from a root node to a leaf node of said treestructure.
 18. The database management system as recited in claim 15wherein said search subsystem employs a binary search without waiting tofetch pointers of nodes of said tree structure.
 19. The databasemanagement system as recited in claim 15 wherein said index prefetchdistance is determined by an external or internal jump pointer array.20. The database management system as recited in claim 19 wherein saidinternal jump pointer array employs at least two bottom non-leaf nodesof said tree structure.
 21. The database management system as recited inclaim 15 wherein said index prefetch distance is limited by a bandwidthof said main memory.